Development of a Modern Greek Broadcast-News Corpus and Speech Recognition System
نویسندگان
چکیده
We report on the creation of a Modern Greek broadcast-news corpus as a pre-requisite to build a large-vocabulary continuous-speech recognition system. We discuss lexical modelling with respect to pronuciation generation and examine the effects of the lexicon size on word accuracies. Peculiarities of Modern Greek as a highly inflectional language and their challenges for speech recognition are discussed.
منابع مشابه
“My Small Slim Greek ASR System” or Automatic Speech Recognition of Modern Greek Broadcast News
In this paper we report on the development of a Modern Greek large-vocabulary continuous-speech recognition system. We discuss lexical modelling with respect to pronuciation generation and examine its effects on word accuracies. Peculiarities of Modern Greek as a highly inflectional language and their challenges for speech recognition are addressed.
متن کاملThai Broadcast News Corpus Construction and Evaluation
Large speech and text corpora are crucial to the development of a state-of-the-art speech recognition system. This paper reports on the construction and evaluation of the first Thai broadcast news speech and text corpora. Specifications and conventions used in the transcription process are described in the paper. The speech corpus contains about 17 hours of speech data while the text corpus was...
متن کاملSpanish broadcast news transcription
We describe the Sail Labs Media Mining System (MMS) aimed at the transcription of Castilian Spanish broadcastnews. In contrast to previous systems, the focus of this system is on Spanish as spoken on the Iberian Peninsula as opposed to the Americas. We discuss the development of a Castilian Spanish broadcast-news corpus suitable for training the various system components of the MMS and report o...
متن کاملThe Slovene BNSI Broadcast News database and reference speech corpus GOS: Towards the uniform guidelines for future work
The aim of the paper is to search for common guidelines for the future development of speech databases for less resourced languages in order to make them the most useful for both main fields of their use, linguistic research and speech technologies. We compare two standards for creating speech databases, one followed when developing the Slovene speech database for automatic speech recognition –...
متن کاملMATBN: A Mandarin Chinese Broadcast News Corpus
The MATBN Mandarin Chinese broadcast news corpus contains a total of 198 hours of broadcast news from the Public Television Service Foundation (Taiwan) with corresponding transcripts. The primary purpose of this collection is to provide training and testing data for continuous speech recognition evaluation in the broadcast news domain. In this paper, we briefly introduce the speech corpus and r...
متن کامل